Analysis and interpretation of joint source separation and sound event detection in domestic environments

In recent years, the relation between Sound Event Detection (SED) and Source Separation (SSep) has received a growing interest, in particular, with the aim to enhance the performance of SED by leveraging the synergies between both tasks. In this paper, we present a detailed description of JSS (Joint Source Separation and Sound Event Detection), our joint-training scheme for SSep and SED, and we measure its performance in the DCASE Challenge for SED in domestic environments. Our experiments demonstrate that JSS can improve SED performance, in terms of Polyphonic Sound Detection Score (PSDS), even without additional training data. Additionally, we conduct a thorough analysis of JSS’s effectiveness across different event classes and in scenarios with severe event overlap, where it is expected to yield further improvements. Furthermore, we introduce an objective measure to assess the diversity of event predictions across the estimated sources, shedding light on how different training strategies impact the separation of sound events. Finally, we provide graphical examples of the Source Separation and Sound Event Detection steps, aiming to facilitate the interpretation of the JSS methods.


Introduction
An important part of the information we obtain from our surroundings is carried by sound, helping us to understand where we are or what is happening around us.With this motivation, several research fields in audio signal processing aim to exploit the contents of sound signals to retrieve information about the environment.For instance, Sound Event Detection (SED) [1] answers the questions of which are the specific events that occur in an audio recording, and when do they begin and end.Other related tasks are Acoustic Scene Classification [2], which labels an audio according to the environment it has been captured in (e.g.house, park, train station), or Automated Audio Captioning [3], which aims to provide a text description of the recording.
With the objective of supporting the research in SED and other environmental sound analysis tasks, yearly challenges are hosted by the DCASE community (Detection and Classification of Acoustic Scenes and Events [4]), suggesting research questions and offering standard frameworks in which different approaches can be compared.Among the focuses of the SED task in recent DCASE challenges, the main one is the exploitation of unlabeled or partially-annotated data to train SED systems [5], whereas complementary lines of research have been proposed, such as the use of Source Separation (SSep) to aid Sound Event Detection [6].
The interest in leveraging data with different degrees of annotation is related to the high cost of data curation and annotation, in comparison to the wide availability of unlabeled data.An illustrative example is AudioSet [7], a large-scale acoustic event dataset containing more than 2 million 10-seconds audio clips, which has increased the viability of deep learning algorithms in SED.The audio clips in AudioSet are extracted from YouTube videos, and are provided with weak annotations (i.e.clip-level labels) or strong annotations [8] (i.e.timestamps of the onset and offset of each event), according to a comprehensive ontology that organizes and describes more than 500 categories of acoustic events.A similar distribution of data annotations is observed in DESED (Domestic Environment Sound Event Detection) [9,10], the dataset employed in the DCASE challenge, which is composed of weakly-labeled and stronglylabeled audio clips, in addition to a majority of unlabeled examples.
With the aim of leveraging unlabeled examples and reducing the dependency on annotated data, semi-supervised and unsupervised learning methods have been developed, currently being evolving fields of research.Unsupervised learning is able to learn using only unlabeled data, whereas semi-supervised learning (SSL) leverages both labeled and unlabeled examples, reducing the amount of necessary annotations [11].In the context of DCASE challenges, the most common SSL algorithm for SED is Mean Teacher, which considers a moving average version (teacher) of the original model (student), and then incorporates the consistency between student and teacher predictions as part of the loss function used to train the student model [12].
Another research interest in DCASE is the use of Source Separation (SSep) to enhance Sound Event Detection.SSep aims to automatically decompose audio mixtures into their underlying components, considering that each component has been produced by a different acoustic source.For example, SSep can isolate the speech signal in a recording that contains speech and background noise, or separate the different instruments in a music mixture.Therefore, the application of SSep to SED serves the purpose of decomposing audio event mixtures into several audio channels, each of them containing lower levels of noise or less overlap of target events, being more adequate inputs to the SED system.
Following this approach, the first method proposed by DCASE involved convolutional masking networks trained for SSep as a pre-processing step for SED [6].A late integration of SSep and SED, combining the SED outputs for the mixture and separated sound sources, was observed to be more beneficial than an early integration (concatenating the separated sources and the mixture as a multi-channel input to the SED model), or a middle integration (concatenating intermediate representations).However, the method provided limited improvements, which was explained by a mismatch between the training data of the SSep model (artificial sound mixtures) and the test data of SED (web audio).For the 2021 challenge, DCASE proposed a baseline system that also used a late fusion of pre-trained SSep and SED, but using a Source Separation model trained over web audio by means of Mixture Invariant Training (MixIT) [13], an unsupervised method for training SSep neural networks.
More recent works have explored the combination of Source Separation and Sound Event Detection or classification.Some of them, in a similar way to DCASE approaches, use Source Separation to enhance the performance of Sound Event Detection systems.For instance, training a Source Separation network from scratch using a task-aware objective [14], or encouraging a SED system to separate sources in its intermediate representations [15].In contrast, other works aim for the opposite direction: employing the information provided by sound classifiers to aid the separation of mixtures into semantically different sources [16][17][18].
Considering that Source Separation can improve Sound Event Detection, and that the information provided by a SED system might be helpful for Source Separation, a training setting which encourages a bidirectional flow of information between SSep and SED seems to be an interesting approach.In previous work [19], we have proposed Joint Sound Event Detection + Source Separation (JSS), a joint model in which a Source Separation block is connected to a Sound Event Detection block.After pre-training each block independently for its correspondent task, the whole model is trained end-to-end using Sound Event Detection objectives, either together (Joint Training) or in a separate stage for each block (Two-stage Training).Moreover, the development of JSS implied an exploration of the model selection strategy employed for the semi-supervised Mean Teacher models, which is especially relevant for iterative training processes.We found that our methods were able to improve SED performance in the context of DCASE Challenge Task 4, especially when the Source Separation block was pretrained using in-domain data.
Building upon previous research, this paper introduces the following main contributions: (1) We offer a comprehensive definition of the JSS method and its different variants (Joint Training and Two-stage Training), including as well the proposed model selection criterion for Mean Teacher, which is proven to enhance the results.(2) We provide results and analysis of JSS over two additional datasets: DESED Public evaluation, and Public overlap (an overlapped version of the aforementioned dataset, introduced in our previous work [20]).( 3) We analyze and discuss several aspects of the JSS method which were not covered by previous works, including its performance for specific sound event categories, or the role of Source Separation in the system, which we measure by means of the similarity of SED predictions across estimated sources.(4) Finally, we offer graphical representations of the intermediate steps of the systems, which aid understanding of the interactions between Sound Event Detection and Source Separation and enhance the interpretability of the resulting systems.

Sound Event Detection
Sound Event Detection aims to obtain, for an input audio signal x, the time boundaries (t on , t off ) for a closed set of K acoustic event classes (Fig 1 ).A common approach is to consider the problem as K binary classification tasks in time, so that a detection score sequence dk 2 ð0; 1Þ T is estimated for each event category, with length T. This set of K score sequences forms a matrix D 2 ð0; 1Þ K�T .
A usual approach for neural network SED systems is to obtain D by means of a K-dimensional output layer with sigmoid activation.In such case, the score sequences D are a function of the input signal x, and the model parameters θ sed (Eq (1)).
Once D is computed, the values of the onset and offset times can be determined by defining a threshold τ 2 (0, 1).Then, for each event category k 2 [1, K], the onsets t on,k are the times when the score goes above the threshold ( dk ðt on;k Þ � t; dk ðt on;k À 1Þ < t), and the offsets t off,k correspond to the time frames when the score becomes lower than the threshold ( dk ðt off;k Þ < t; dk ðt off;k À 1Þ � t).As a post-processing step, a median filter is applied after thresholding, in order to avoid spurious onsets and offsets.

Sound Event Detection in DCASE challenge
The scope of Sound Event Detection in the DCASE Challenge Task 4 is focused on domestic environments, which are especially relevant for indoor applications such as home assistance or security.In this direction, a set of 10 event categories drawn from the AudioSet ontology is considered: Alarm/bell/ringing, Blender, Cat, Dishes, Dog, Electric shaver/toothbrush, Frying, Running water, Speech, and Vacuum cleaner.An example of a mel-spectrogram representation of each event category is provided in Fig 2.
Several research questions have been stated in the recent editions of DCASE Challenge Task 4, most of them regarding the use of different types of training data, such as a large amount of unlabeled audio clips from web videos or strongly-labeled synthetic recordings [21], in addition to a small set of weakly-labeled data.For this purpose, semi-supervised learning approaches were the main focus, with Mean Teacher [12] becoming a standard approach thanks to its simplicity and good results [22].
A SED Baseline system is provided each year by the challenge, aiming to establish a performance benchmark, and including some advances of the state of the art.The current baseline is a Convolutional-Recurrent Neural Network (CRNN) [23].
In 2020, the challenge proposed Source Separation for SED as an auxiliary task, called Sound Event Separation and Detection (SSep+SED).This task involved the use of SSep systems to separate overlapping sound events and extract foreground sound events from the background sound, and in it introduced an additional Baseline system, described in Fig 3 .In order to train the SSep+SED Baseline system, the training data is first separated using a pre-trained Source Separation network.Then, the SED baseline system, already trained over the original mixtures, is fine-tuned to the separated data.In order to obtain the final score sequences, the outputs of the fine-tuned SED system over separated sources, Dsep , are combined with the outputs of the pre-trained SED Baseline over the mixtures Dmix , as described in Eq (2).The combination weight, q, is learnt during the fine-tuning process.
In the DCASE SSep+SED Baseline, both the SSep network and the SED model that is applied over the mixtures are frozen, meaning that their parameters are not updated during the training process.

Semi-supervised Sound Event Detection with Mean Teacher
The scarcity of annotations in Sound Event Detection training data can be solved by means of a semi-supervised learning algorithm.Particularly, Mean Teacher is the method proposed by the DCASE Baseline system.Mean Teacher training considers two models, student and teacher, with identical structure.The weights of the student model, θ (s) , are trained with back-propagation using a loss function L sed , whereas the weights of the teacher model, θ (t) , are computed at each training step n as an exponential moving average (EMA) of θ (s) (Eq (3)).The weight α ema 2 (0, 1) is a hyperparameter that determines the exponential decay, with higher values resulting in slower updates of the teacher model.Its optimal value can be obtained empirically.
In order to leverage the information provided by both labeled and unlabeled data, the loss function L sed is divided into two components: • A supervised loss (L sup , Eq (4)), which is implemented as a Binary Cross-Entropy between the student score sequences ( DðsÞ ) and the ground truth annotations, D.
• A self-supervised consistency loss (L self , Eq (5)), computed as the Mean Squared Error between student and teacher predictions.
Whereas L sup can only be computed for labeled examples, L self does not require ground truth annotations.This allows the models to learn from all training examples.
The global loss function for Sound Event Detection, L sed , is computed as a weighted sum of both components (Eq (6)), and used to train the student model.The weight of the self-supervised loss (α self ) regulates the contribution of the consistency measure.
Therefore, a feedback loop is created between student and teacher: the student model is trained to minimize a loss function that considers consistency with teacher predictions, while the teacher is computed as a smoothed (averaged) version of the student.

Metrics and evaluation of Sound Event Detection systems
F 1 -score-based metrics.F 1 -score (Eq (7)) is a widely adopted metric to measure the performance of SED systems.It can be computed as the harmonic mean of Precision (P) and Recall (R), defined in Eqs ( 8) and ( 9) respectively.Thus, taking into account the definitions of Precision and Recall, F 1 -score ultimately depends on the number of True Positive (TP), False Negative (FN), and False Positive (FP) decisions of the system.
In DCASE Task 4, the main F 1 score is event-based (or collar-based), meaning that the decisions of the system are measured for entire occurrences of an event (TP, FN) or a prediction (FP), considering a certain tolerance between the system predictions and the ground truth annotations.
Polyphonic Sound Detection Score.In recent editions of DCASE Task 4, Polyphonic Sound Detection Score (PSDS) [24] is proposed as a performance metric for SED.The aim of PSDS is to solve some limitations of F 1 -scoring, particularly the dependence on a single decision threshold, the lack of robustness to annotation subjectivity, and the agnosticism to crosstriggered detections.
To handle these problems, PSDS is defined as the area under a curve determined by the performance of the system at different thresholds, considering a True Positive Rate and a False Positive Rate.In contrast with event-based and segment-based F 1 , TPs and FPs follow an intersection criterion, more robust to variability of the ground truth labels.Finally, cross-triggers are defined in PSDS as FP decisions that coincide with a different target category, and are considered as a different kind of error.
An additional contribution of PSDS is the definition of several parameters that allow to adapt the metric to different applications or scenarios.The Detection Tolerance Criterion (DTC) and Ground Truth Intersection Criterion (GTC) define the amount of intersection between predictions and annotations that is necessary to consider a correct detection, and the Cross-Trigger Tolerance (cttc) establishes the threshold for cross-triggered decisions.The penalty introduced by cross-triggers (α CT ) and a cost for instability between classes (α ST ) can also be configured.
DCASE Task 4 proposes two different configurations for PSDS.The first scenario (PSDS1) encourages systems to make a finer temporal segmentation of the events, whereas the second scenario (PSDS2) is more focused on classification accuracy.Both of them are computed using 50 threshold values, linearly distributed from 0 to 1.The parameter settings for each scenario are described in Table 1.

Source separation
Acoustic Source Separation (or Source Separation) can be stated as a regression task, in which an input sound mixture x is decomposed into M estimates, Ŝ ¼ hŝ m i; 1 � m � M, of its underlying source components (Fig 4).Additionally, a consistency constraint is usually applied, so that the sum of output sources is equivalent to the input mixture, Considering a separation model f (sep) , with parameters θ sep , the estimate is obtained as shown in Eq (10).
Given that the goal is to reproduce the reference sources, SSep models can be trained employing a negative Signal-to-Noise Ratio (SNR) loss (Eq (11)), in which the target signals are the reference sources s, and the noise is the error s À ŝ.A small quantity � is added to this term, to prevent the division by zero in the case that ŝ ¼ s.
Several tasks involve Acoustic Source Separation with different kinds of audio mixtures, or different constraints on the acoustic sources of interest.Some examples are music source separation [25], which aims to isolate the different instruments present in a music signal, speech separation [26], in which a mixture of various speakers is divided into individual speech signals for each speaker, or speech enhancement [27], consisting on improving the quality of a noisy speech signal by removing the non-speech content, thus separating the input mixture into an only-speech channel and a non-speech channel.
An additional application is Universal Source Separation, defined as the decomposition of any acoustic soundscape into sounds of arbitrary types [28].In this scenario, some particularities have to be considered, for instance, the order of the output signals should not be relevant (permutation problem), and the number of sources is inherently unknown.Moreover, in order for a model to learn the separation of arbitrary classes, the training data should include a great diversity of sounds.

Supervised Source Separation with Permutation Invariant Training
Permutation Invariant Training (PIT) was introduced as a solution to the permutation problem [29].Considering a loss function L sep ðS; ŜÞ, PIT compares all the possible permutations of  the estimated sources Ŝ (using the permutation matrix P) with the targets S, and takes the minimum loss value as the result, thus making the training process independent of the order of the outputs.The PIT loss is thus defined in Eq (12).
In order for a SSep model with a fixed number of outputs (M) to deal with a variable number of target sources during training, a new loss function is proposed in [30], dividing L sep into two SNR-based terms: an active loss (L a ) and an inactive loss (L 0 ).When applying PIT, the active loss is derived from the negative SNR loss (Eq (11)), and computed with respect to the M a active target sources (M a � M), as shown in Eq (13), whereas the inactive SNR loss is applied with respect to M − M a null signals, aiming to minimize the separated source power of inactive channels (Eq ( 14)).
The parameter τ in Eqs ( 13) and ( 14) is introduced as a soft threshold to determine the maximum SNR value, aiming to prevent large gradients from dominating the total loss.
Thus, the PIT loss with a variable number of target sources is defined in Eq (15).
Regarding the availability of training data for Universal Source Separation, the main limitation is the access to the reference sources of the mixtures, which generally are not possible to obtain.However, it is possible to create artificial training mixtures by overlapping several isolated sources, which can then be used as targets, as in the case of the FUSS dataset (Free Universal Source Separation) [30].

Unsupervised Source Separation with Mixture Invariant Training
A different approach is to train Source Separation in an unsupervised fashion.In this way, reference sources are not necessary to train the models, allowing the use of non-artificial datasets.This is the motivation of Mixture Invariant Training (MixIT), an unsupervised learning algorithm for Source Separation based on the use of "mixtures of mixtures".MixIT proposes to use the sum of two audio mixtures, x 1 + x 2 , as input to the model, so that the M outputs of the model, Ŝ, are expected to contain the underlying sources of x 1 and x 2 (Eq (16)).
Given that reference sources for the mixtures are not available, MixIT computes all possible assignations of each output to either x 1 or x 2 , by means of a binary assignation matrix A, with size (2, M), in which each column sums up to one.The MixIT loss is then computed in Eq (17) as the minimal loss considering every possible assignation between outputs and inputs.
Whereas this method overcomes the need of target sources to train a separation model, which is the main limitation of PIT, it raises some problems, the main one being a tendency for over-separation, in which a single source is decomposed into several output signals, generally leading to more active output sources than necessary.This occurs because the MixIT loss is blind to the content of individual estimated sources ŝm , as long as a good reconstruction of the input mixtures is possible.More recent additions to MixIT have dealt with this problem by including penalties for over-separation, such as sparsity loss or covariance loss [18].

Source Separation with mask estimation neural networks
The decomposition of a sound mixture x into several components can be approached as a mask estimation problem in the time-frequency domain, dividing the task into three stages: the transformation of the audio signal x into a time-frequency representation with an encoder function E, the computation of the mask for each source, W m , and the reconstruction of the estimated sources Ŝ in the time domain with a decoder function D. The encoder and decoder functions can either be pre-defined transformations (e.g.Short-Time Fourier Transform) or learnt from data.
The masks are weights that control the contribution of the mixture to each estimated source, so that each source ŝm can be obtained by performing an element-wise multiplication (�) between the encoded input (EðxÞ) and the corresponding mask W m , and then decoding the result with the decoder function D, as described in Eq (18).
An example of mask estimation neural network for Source Separation is ConvTas-Net [31].In particular, its architecture is formed by a fully-convolutional separation module, with several repeats of convolutional blocks with increasing dilation factors.The masks are estimated with a pointwise convolution, and the encoder and decoder functions are learnt during training.

Metrics and evaluation of source separation systems
In order to evaluate the performance of Source Separation models, a standard approach is to measure the mean SNR improvement (SNRi, Eq (19)) of the estimated sources (ŝ) to their corresponding reference signals (s), with respect to the SNR obtained when using the mixture (x) as estimation.In order to prevent divisions by zero and ensure numerical stability, an infinitesimally small positive quantity, �, is added to the denominators in Eq (19).SNRiðŝ; s; xÞ ¼ 10 log 10 jjsjj Proposed methods

Joint Source Separation + Sound Event Detection
Considering the potential mutual benefits of the tasks of Source Separation and Sound Event Detection, we aim to complement both in a single system, called Joint SSep + SED (JSS).Such a system receives an audio mixture as input and computes the temporal boundaries of sound events, in the same manner as a traditional SED system, but with the difference that the predictions are obtained from automatically separated sources of the mixture, which are computed during the same inference process (Fig 5).
The proposed JSS systems consist, then, of a Source Separation block (f (sep) ) that divides the input mixture into M sources (Eq (20)), and a Sound Event Detection block (f (sed) ) that obtains event score sequences ( Dm ) for each of the estimated sources, considering K event classes (Eq (21)).
Afterwards, the source-level scores are combined by means of a pooling function (Eq ( 22)), such as an average, obtaining mixture-level score sequences, In order to train the JSS model, we propose an iterative process.First, the Source Separation and the Sound Event Detection blocks are pre-trained separately.The SED block is pre-trained in the same manner as the DCASE SED Baseline system, using Mean Teacher semi-supervised training, whereas two different pre-training methods are considered for the SSep block: a supervised pre-training with PIT (Eq (12)), and an unsupervised pre-training with MixIT (Eq (17)).
Afterwards, taking the pre-trained blocks as a starting point, we compare two training methods.On the one hand, Joint Training (JT) performs a single training process, updating the weights of both blocks (θ sep and θ sed ) simultaneously (   In both JT and TST, the Mean Teacher method is applied, in order to deal with different levels of annotation in training data.The loss function employed is the SED objective L sed , described in Eq (6).

Mean Teacher model selection
The aim of the training process is to find the optimal set of weights, θ*, that maximizes the performance in external data.For this purpose, the model is tested over a validation data set after each training epoch, computing an objective metric P obj .When the training is finished, the set of weights that maximizes P obj is selected as the best model.
Moreover, the Mean Teacher training process updates two models in parallel: student and teacher.Therefore, at each training epoch i, the training has learnt two different sets of weights, θ ðsÞ i , for the student, and θ ðtÞ i , for the teacher.As described in Eq (3), the weights of the teacher are obtained as an exponential moving average of the student weights in previous steps, resulting in a smoother version of the student.
The DCASE SED Baseline model selection finds the best epoch, b, according only to the student models (Eq (23)).Then, at test time, the student and teacher models at epoch b, with weights θ ðsÞ b and θ ðtÞ b respectively, are evaluated over the dev-test data in terms of PSDS and F 1 scores.The decision whether to choose the student or the teacher model for external data (e.g. the DCASE evaluation dataset) can be made at this point, but it is usually observed that the teacher model gives better results. The Considering that teacher models often perform better than students at test time, even when selecting the best epoch with student models, the proposed criterion is expected to provide enhanced results.Moreover, the adequacy of model weight averaging for generalization in deep neural networks has already been discussed in other training techniques, such as Stochastic Weight Averaging [32], supporting the hypothesis that teacher models, constructed by averaging the weights of students (Eq (3)), provide enhanced robustness.In order to compare the aforementioned settings, the two PSDS scenarios proposed by the DCASE Challenge are considered as performance metrics, as well as the event-based F 1 score.In addition to global performance, class-wise metrics are provided, so the goodness of each system for specific event categories can be compared.

Proposed experiments
Moreover, aiming to compare the proposed JSS models, which are composed of a single branch (Fig 5), with the SSep-SED Baseline of DCASE, which is a combination of two separate branches (Fig 3 ), we evaluate different model combinations that are computed as late fusions, averaging the mixture-level SED score sequences ( D).The combination procedure for N models is described in Eq (25).
DðnÞ : ð25Þ Datasets DESED: Domestic Environment Sound Event Detection.DESED [9,10] is the Sound Event Detection dataset for DCASE Task 4, and it is composed of 10-second audio clips with different origins and types of annotations.According to the source of the audio, the available labels, and the purpose of the data, several subsets are defined: • Weak training: 1578 recordings obtained from AudioSet, including the clip-level annotations for target events.
• Unlabeled training: 14412 clips obtained from AudioSet, with no annotations available.
• Synthetic training: 12500 artificial audio mixtures created by overlapping recordings of foreground events (obtained from FSD) and background sounds (from the SINS dataset [33]).The Scaper toolkit [34] is used to generate the mixtures, also providing their strong annotations.
• Validation: 1168 clips obtained from AudioSet, and manually annotated with strong labels.
• Public evaluation: 692 clips obtained from AudioSet, and manually annotated with strong labels.
The first three subsets (Weak, Unlabeled, and Synthetic) are intended for model training, whereas the Validation subset serves as development data during the challenge in order for participants to choose their best systems.The Public evaluation subset is an additional test dataset, which aims to assess whether the decisions made over the Validation set generalize to different data.
In addition to the described subsets of DESED, we consider a complementary test dataset: the Public overlap set, proposed in one of our previous analyses of the Sound Event Detection task [20], is designed to evaluate the performance of SED systems in conditions of severe overlap between different events, which have been shown to represent a challenging scenario for accurate event detection.In order to obtain the audio mixtures of the Public overlap set, the audio segments of the DESED Public evaluation set are randomly added together in pairs, joining their ground truth annotations accordingly.The Public overlap set is formed by three permutations of the Public evaluation segments, resulting in 2076 audio mixtures (three times the size of the Public evaluation set).
It should be noted that the Public overlap set is designed to represent artificial co-occurrence of sound events, which could have a different impact on the performance of the models compared to naturally overlapped sounds (i.e., several sound sources being recorded at the same time).Although natural overlap can be found in other DESED datasets, an analysis of performance under such kind of overlap would present certain limitations, due to the lack of annotations for non-target events and the scarcity of examples of overlap between target events [35].
FUSS: Free Universal Source Separation.FUSS [30] is a Universal Source Separation dataset, built by means of audio overlapping.In order to obtain the artificial mixtures contained in FUSS, a background recording and one to three foreground recordings with different sound categories are summed together, in a similar way to DESED synthetic audios.The background and foreground recordings are obtained from FSD.
The dataset contains 20000 training mixtures, 1000 mixtures for validation (i.e.model selection) and 1000 for test.Given that all of them are artificially composed, the individual sources are available as training targets, allowing the use of supervised algorithms for Source Separation training.
YFCC100M: Yahoo-Flickr Creative Commons.YFCC100M [36] is a large-scale multimedia dataset formed by free-licensed videos and pictures obtained from web sources.Although individual sources of the audio from the nearly 800000 videos are not provided, the audio tracks can be used as Universal Source Separation training data by means of an unsupervised approach such as MixIT [13].

Model settings
Sound Event Detection baseline settings.Regarding the SED block, our models share their structure and settings with the SED Baseline of DCASE 2021 Task 4. Such model is a CRNN with a convolutional stage of 7 layers and a recurrent stage of 2 Bidirectional Gated Recurrent Units (Bi-GRU) [23], implemented with pytorch [37].
Mean Teacher training is employed, with a learning rate of 10 −3 and an EMA factor of α ema = 0.999.Mixup data augmentation [38] and dropout regularization [39] are applied, with 0.5 probability each.Median filtering is applied to the output prediction scores with a filter length of 450ms.
The SED baseline is fed with mel-spectrogram features of the audio segments (sampled at 16kHz), with 128 mel filters, Hamming windows of L = 2048 samples, spaced by R = 256 samples, and Fast Fourier Transforms (FFT) of N = 2048 samples.This feature configuration is kept for the rest of the models.
In terms of data distribution, the training set is formed by the DESED Unlabeled, Weak and Synthetic data sets, from which 10% of the weak set and 20% of the synthetic set are reserved as validation data to perform model selection.The DESED Validation (dev-test) and Public Evaluation sets are used as test data.
In the SED Baseline, model selection is performed using the student model (Eq (23)).The objective metric P ðBsÞ obj (Eq (26)) is the macro-averaged F 1 score obtained by the student over the validation subset, formed by weak and synthetic training segments.An intersection-based F 1 score is computed over synthetic data, whereas weak F 1 is computed for the weakly-labeled data.
SSep+SED baseline settings.The SSep+SED Baseline involves pre-trained models for SSep and SED.Whereas the SED stage of the system is identical to the SED Baseline, the SSep model is an Improved Time-Domain Convolutional Network (TDCN++) [13], similar to Con-vTas-Net, with M = 8 outputs.The model, implemented in TensorFlow, is trained in an unsupervised fashion using MixIT, employing 1600 hours of audio from YFCC100M as training data.
The score pooling function used in this baseline is a sum, meaning that the mixture-level scores for class k, dk , are the sum of the predictions over the estimated sources, as described in Eq (27).
In this system, Source Separation is performed as an offline process.Therefore, the pretrained SSep model is not updated or fine-tuned to the SED task or the DCASE data.The finetuning process relies on the same configuration as the SED Baseline, including mel-spectrogram features, Mean Teacher training and model selection.
Joint Source Separation + Sound Event Detection settings.In contrast with the SSep +SED Baseline, our proposed methods use a ConvTas-Net model for Source Separation.This model is implemented in pytorch within the audio separation toolkit Asteroid [40], allowing for better integration with the SED Baseline.The structure of the ConvTas-Net is configured with M = 4 outputs, R = 1 repeat and X = 4 convolutional blocks.
The pre-training of the ConvTas-Net SSep block is performed either using PIT (supervised) or MixIT (unsupervised).In the case of supervised training, the FUSS dataset is used as train- In order to combine source-level scores into mixture-level scores, JSS models use a maxpooling function, defined in Eq (28).
Model selection is applied either with the default method (student model selection, Eq (23)) or with the proposed teacher model selection (Eq (24)).In the latter case, the default objective function is computed using the teacher models (θ (t) ), as is shown in Eq (29).Then, at test time, the best teacher model is used for inference.

Results of individual models
The performance obtained by each of the proposed methods is measured in terms of PSDS and event-based F 1 score.Following the evaluation rules of the DCASE Challenge, the scenarios PSDS1 and PSDS2 are considered, and the macro-averaged event-based F 1 score is computed with a 200ms collar for onsets and a collar length of max(200ms, 0.2l) for offsets, where l is the length of the event.
Regarding Table 2 shows the global performance of the SED Baseline and the proposed models over the DESED Validation and DESED Public evaluation sets, in terms of the three considered metrics.Since the DCASE SSep-SED baseline is in fact a combination of a SSep+SED system and a SED system (Fig 3 ), its results will be included next to the model combinations.
Most conclusions are similar for both of the datasets.Generally, the proposed teacher model selection provides better results than the default model selection, which is based on the student models.Although the change in the model selection strategy was motivated by the existence of several model selection decisions in the proposed methods, this improvement is observed also in the baseline system.Therefore, it is shown that the teacher model selection method is also beneficial for regular Sound Event Detection models trained with Mean Teacher.Regarding the two proposed pre-training methods for the Source Separation block, better performance is usually obtained when employing Mixture Invariant Training over the ) provides higher PSDS2, achieving in both cases slightly better performance than the SED baseline.The results of the different stages of TST give some insights about their impact in the final performance: First, the results of the initial state of the joint model, Stage 0, are considerably worse than the Baseline, indicating that the pre-trained SSep block introduces a domain mismatch with respect to the original mixtures.The first stage (S1), which fine-tunes the SED block, lowers this gap in performance, yielding results closer to the Baseline.Stage 2 is able to improve the results of S1 by fine-tuning the SSep block for the Sound Event Detection task.

Results of combined models
Aiming to allow a comparison with the SSep-SED Baseline of DCASE 2021, which is itself a combination of a SSep+SED system and a SED system, several model combinations are defined between the SED Baseline and the JSS models (with DESED SSep pre-training and teacher model selection).Their results are gathered in Table 3, showing that all the combinations outperform the SSep-SED baseline in terms of PSDS1 and PSDS2 over the Validation set.However, only the combinations which include both DESED-S2 and DESED-JT models are able to obtain a higher F 1 score than the SSep-SED baseline system.When observing the results over the Validation set, most of the fusions obtain lower PSDS1 than the SSep-SED baseline, and none of them is able to obtain a higher F 1 score.Nonetheless, an improvement in PSDS2 is obtained by the combinations that include DESED-S1 or DESED-S2.

Results under event overlap
Source Separation as an auxiliary task for Sound Event Detection should be especially helpful in situations when several events coincide in the same lapse of time.For that reason, we assess the performance of the JSS models over the Public overlap set [20], providing the results in Table 4.
The teacher model selection criterion and the unsupervised Source Separation pre-training with DESED are also beneficial for this scenario, however, the best results in PSDS 1 and 2 are obtained by the S1 models, and the best F 1 result is held by the S0 model.These behaviors The results of the combined models over the Public overlap set are shown in Table 5.In this case, the SSep-SED Baseline yields the best PSDS1 result, while every combination that includes S1 or S2 obtains better PSDS2 than the baseline.The F 1 result of the baseline is only outperformed by the combinations that include the Stage 1 model.This results follow a similar behavior to those of the individual models.

Further analysis and discussion
Although the results show that the proposed methods yield small but consistent improvements in the DCASE Sound Event Detection task, further analysis can provide a better understanding about the role and impact of Source Separation in SED.For this purpose, we study three additional aspects: First, the class-wise results of the JSS models, which allow us to analyse the impact of Source Separation in different kinds of events.Then, given that the main motivation for Source Separation as a pre-processing step for SED is to isolate different audio events in separate sources, we propose a metric that aims to assess to what extent this is achieved in the different proposed models.Finally, we provide graphical representations for some test examples, including the sources estimated by the SSep block as well as the source-level and mixture-level SED scores.

Class-wise results
The evaluation framework of DCASE Challenge Task 4 is mainly focused on the global performance of the models over the whole set of 10 event categories.However, the different target events are noticeably diverse in terms of acoustic characteristics, for instance, regarding their duration or their spectral properties [35], therefore some models could be more fitted to correctly detecting a certain subset of event categories [41].This motivates a class-wise analysis of the results, aiming to better understand the behaviour of the JSS systems for different event classes.For this purpose, we provide the class-wise results of the proposed models (employing Teacher model selection) over the Public evaluation set in Figs 8 and 9.
The class-wise results suggest that the domain mismatch introduced by Source Separation, which severely impacts the performance of Stage 0 models, does not affect all classes.In fact, this initial state performs equally, or even slightly better than the SED baseline in some event categories (e.g.Cat, Dog in terms of PSDS1, Vacuum cleaner in terms of PSDS2).This indicates that the artifacts introduced by Source Separation do not alter the identification of these events.Nevertheless, the detection of other classes is noticeably degraded, especially Dishes.This is the event category for which the SED baseline yields its worst performance, meaning that its correct detection is particularly difficult.Thus, in this class the SED block is less robust to the domain mismatch introduced in Stage 0.
When comparing the two different approaches to SSep pre-training, some differences can be observed.For instance, in Stage 0 models, some classes obtain clearly lower results with FUSS supervised pre-training than with DESED unsupervised pre-training (e.g.Alarm bell

Diversity of predictions across sources
The motivation for the use of Source Separation as an auxiliary task for Sound Event Detection is the idea that automatically separated sources can be more adequate inputs for Sound Event Detection.This hypothesis can be assessed by measuring the SED performance of the systems, as done in the Results section.However, an analysis of the interaction between SSep and SED in the proposed models would be able to provide more specific insights.
For this purpose, we have studied the diversity of Sound Event Detection predictions across the M different sources estimated for each sound mixture.We consider the source-level score sequences, Dm ¼ h dm;k i K k¼1 ; m 2 ½1; M�, and quantify the similarities between every source pair by means of cosine scoring.The similarity between the predictions of a pair of sources, (m, n) 2 [1, M], m 6 ¼ n, is computed in Eq (30) as the average of their cosine similarity at each time step t.
The total similarity score of an audio segment is computed as the average similarity of the SED scores for every pair of estimated sources.Given that the SED scores are bound between 0 and 1, the similarity score is as well.
The distribution of similarity scores for test segments using different JSS models is shown in Fig 10 .In general, higher similarity scores are observed for Public eval data than for Public overlap data, which is expected, considering that a larger number of events in a mixture allows for a higher diversity of events in its estimated sources.An additional general observation is Regarding the different models, the least similar predictions across sources are obtained by the initial state, Stage 0, and the similarity of predictions increases with each step of Two-stage training (stages 1 and 2).The most similar predictions, however, are obtained with Joint Training.This behavior is particularly noticeable in the systems with unsupervised pre-training of SSep, which could indicate that the unsupervised SSep block is more prone to forgetting source separation, to some extent, when fine-tuned for the SED task.
Overall, the JSS model with more similar predictions is DESED-JT, with most of the Public eval examples obtaining similarity scores higher than 0.9.In practice, this model is detecting almost the same events in every estimated source, meaning that Source Separation is not performing a relevant role.
A comparison of mean similarity scores and SED performances of the different proposed models is provided in Fig 11 .Although the similarity scores do not provide a complete explanation of the differences in performance, it can be observed that, with the exception of Stage 0 systems in some cases, more diverse predictions across estimated sources generally lead to better results in Public overlap data, whereas this does not happen in Public eval data, suggesting that effective Source Separation is more beneficial when tackling severely overlapped scenarios.This explains the lower performance of Joint Training or Stage 2 in overlapped data, when compared to Stage 1 (or even Stage 0, in terms of F 1 score).However, this conclusion is limited due to the fact that the examples in Public overlap are artificially generated, which is an advantage for the SSep models employed (particularly for those pre-trained with DESED).
In conclusion, the proposed similarity score has allowed us to measure and observe the interactions between Source Separation and Sound Event Detection, highlighting the main difficulty of training SSep and SED systems jointly: when training JSS systems with a SED objective, the SSep block tends to forget its original task (decomposing the mixture into its different components), and provides instead similar signals for all its output channels.Although this issue does not necessarily harm SED performance (Fig 11), it does not align with the original motivation of JSS, which is to improve the detection performance by separating the input mixture into simpler components.

Intermediate representations of Joint Source Separation and Sound Event Detection
So far, we have analysed the SED performance of the JSS models, both globally and class-wise, and we have observed the effect of Source Separation in the SED predictions by measuring the diversity of the SED scores across the estimated sources.In this section, we aim to complement the analysis of JSS, providing a global overview of the method by representing the inputs and  positive activations of other classes: "Blender", "Alarm bell ringing" and, more noticeably, "Dog", which illustrates the effect of the domain mismatch.
After the fine-tuning of the SED block in Stage 1 (Fig 13), these false positives are solved.In this stage, both events are detected in the four estimated sources, but the confidences in each one of them are different: the first two sources obtain high confidence for "Vacuum cleaner", but a moderate confidence for "Speech", even skipping one of its appearances.In contrast, a higher confidence for "Speech" is observed in the two last sources (which present cleaner speech mel-spectrograms), whereas the confidence for the detection of "Vacuum cleaner" is much lower.This suggests that the correct separation of events is helpful in order to enhance the confidence of the detections.

Conclusions and future work
In this work, we define and analyze a method for Sound Event Detection that includes Source Separation as an integral component of the neural network structure.With respect to other related works in the field, the proposed Joint Source Separation and Sound Event Detection (JSS) method allows an explicit interaction between SSep and SED, thanks to a joint model that is trained in an end-to-end fashion, built from two pre-trained neural networks, for SSep and SED respectively.
The experimental framework is configured according to the DCASE Challenge Task 4, "Sound Event Detection and Separation in Domestic Environments", providing results over the DESED Validation and Public evaluation sets.Moreover, given that the benefits of the SSep stage should be more evident when dealing with highly overlapped data, we offer results over an additional dataset containing sound mixtures severely affected by event overlap.In all of the studied data sets, the proposed models outperform the benchmark set by the DCASE SED Mean Teacher baseline in terms of two different PSDS scenarios.
Considering that SSep pre-training is required for JSS, the availability of in-domain SSep training data could become a limiting factor for the use of JSS.However, our experiments show that unsupervised SSep with Mixture Invariant Training is an adequate choice for the SSep pre-training step, even reaching better performance than supervised SSep pre-training over out-of-domain data.Additionally, our proposed model selection strategy for Mean Teacher, based on the teacher models, provides consistent improvements in SED performance at test time: such strategy is suitable not only for JSS models, but also for other mean-teacherbased SED models and, possibly, for Mean Teacher training in different applications.
Finally, aiming to provide further analysis and discussion of the role of Source Separation in the proposed models, we propose a study on the diversity of event predictions across different separated sources.The results have helped to highlight the main limitation of the proposed joint training methods: when training the Source Separation stage jointly with the Sound Event Detection stage using a SED objective, the SSep stage tends to forget separating sources and provide the same (or very similar) output for all the sources.In future work, we consider that the joint training process should be enhanced to deal with this limitation, for instance, combining the SED loss function with a Source Separation loss term in order to encourage separation and detection in a multi-task learning fashion.

Fig 1 .
Fig 1. Block diagram of Sound Event Detection (SED).A Sound Event Detection system computes, for an input audio mixture, the temporal boundaries of a set of K event categories.https://doi.org/10.1371/journal.pone.0303994.g001

Fig 2 .
Fig 2. Sample mel-spectrogram representations of the 10 different target event categories considered in DCASE Challenge Task 4. The represented audio segments are extracted from the DCASE Public evaluation set, considering segments in which a unique target event is annotated.https://doi.org/10.1371/journal.pone.0303994.g002

Fig 3 .
Fig 3. Block diagram of the DCASE 2021 Baseline system for Source Separation + Sound Event Detection (SSep-SED).The weight q is learnt during the training process.The parameters of the frozen blocks are not updated during the training process.https://doi.org/10.1371/journal.pone.0303994.g003

Fig 4 .
Fig 4. Block diagram of Source Separation (SSep).A Source Separation system decomposes the input audio mixture into a predefined number (M) of estimated sources.https://doi.org/10.1371/journal.pone.0303994.g004 Fig 6).Alternatively, Two-stage Training (TST) performs two additional training processes (Fig 7), the first one updating only θ sed (Stage 1), and the second one updating only θ sep (Stage 2).The main motivation for training the SSep and SED blocks together is that the separation artifacts introduced by the SSep block in audio signals create a domain mismatch with the pretrained SED block, requiring a fine-tuning to separated sources.

Fig 6 .Fig 7 .
Fig 6.Block diagram of Joint training for Joint Source Separation + Sound Event Detection.https://doi.org/10.1371/journal.pone.0303994.g006 proposed training methods for JSS, especially Two-stage Training, require several Mean Teacher training processes, each of them involving a model selection decision (Fig 7).Therefore, model selection is particularly relevant in JSS.In order to enhance the model selection criterion employed by the baseline, we propose a teacher model selection, which searches the best epoch b (t) considering the teacher models at each training epoch i (Eq (24)).In this manner, we select the best teacher model, with weights θ ðtÞ b ðtÞ , instead of the best student model, with weights θ ðsÞ b , according to P obj .b ðtÞ ¼ arg max i P obj ðθ ðtÞ i Þ: Our experimental settings aim to compare the Sound Event Detection performance of the JSS methods (Joint Training and Two-stage Training) with two baseline systems, both proposed by the DCASE Challenge: a SED system and a SSep+SED system.Moreover, the experiments are designed to offer a comparison of the different methods described for JSS: • Joint Training vs. Two-stage Training.Whereas JT requires less training processes, TST ensures an independent convergence for the SED and the SSep blocks, training each one of them in different stages.• Supervised vs. Unsupervised Source Separation pre-training.The supervised pre-training for SSep requires a SSep dataset with oracle sources available, limiting the use of in-domain data.On the other hand, unsupervised pre-training with MixIT allows the use of in-domain data, even without reference sources available.• Student model selection vs. Teacher model selection.In contrast with the default model selection criterion for Mean Teacher in the DCASE Baseline, our proposed model selection aims to provide enhanced robustness by taking into account the performance of the teacher models.
ing and validation data, whereas the unsupervised training with MixIT uses data from DESED.In particular, the DESED Synthetic and Unlabeled training sets are used for training, and the Weak training set is used for validation.The pre-training of the SED stage in JSS is the same as in the Baseline systems.The melspectrogram and Mean Teacher training settings are preserved for the posterior stages of JSS, whereas the learning rate is decreased after each model selection: to 10 −4 in Joint Training and Stage 1 of Two-stage Training, and to 10 −5 in Stage 2.
JSS, results are provided for the Joint Training (JT) and Two-stage Training (TST) methods, dividing TST into Stage 1 (S1) and Stage 2 (S2).The results of the initial state of the JSS model (a concatenation of the pre-trained blocks for SSep and SED) are included as Stage 0 (S0).The performances of the SED and the SSep+SED Baselines (SED Bs and SSep-SED Bs) are provided as benchmarks.

Fig 9 .
Fig 9. Class-wise PSDS2 results of individual models with teacher model selection over DESED Public evaluation data.The SED Baseline performance is indicated with a blue horizontal line.https://doi.org/10.1371/journal.pone.0303994.g009

Fig 11 .
Fig 11.PSDS1, PSDS2 and F1 results of individual models with Teacher model selection over DESED Public evaluation (left) and Public overlap (right), plotted against their cosine-distance-based similarity scores.The SED Baseline is represented as a reference, considering that its similarity score is 1. https://doi.org/10.1371/journal.pone.0303994.g011

Fig 12 .
Fig 12. Visualization of the audio segment "AW9ZKFZKhDE_49_59", from DESED Public evaluation, as processed by the JSS model DESED-S0 (Teacher model selection).The top left plot is the input mixture waveform, and below the mixture mel-spectrogram is shown.The four bottom mel-spectrograms in the left column are the sources estimated by the SSep block.Next to each source mel-spectrogram, its corresponding SED score sequences are represented.Over the source-level scores, the mixture-level score sequences are shown.Finally, the upper right plot represents the ground truth annotations of the segment.The figure is best viewed in color.https://doi.org/10.1371/journal.pone.0303994.g012

Fig 13 .
Fig 13.Visualization of the audio segment "AW9ZKFZKhDE_49_59", from DESED Public evaluation, as processed by the JSS model DESED-S1 (Teacher model selection).The top left plot is the input mixture waveform, and below the mixture mel-spectrogram is shown.The four bottom mel-spectrograms in the left column are the sources estimated by the SSep block.Next to each source mel-spectrogram, its corresponding SED score sequences are represented.Over the source-level scores, the mixture-level score sequences are shown.Finally, the upper right plot represents the ground truth annotations of the segment.The figure is best viewed in color.https://doi.org/10.1371/journal.pone.0303994.g013

Fig 14 .
Fig 14.Visualization of the audio segment "AW9ZKFZKhDE_49_59", from DESED Public evaluation, as processed by the JSS model DESED-S2 (Teacher model selection).The top left plot is the input mixture waveform, and below the mixture mel-spectrogram is shown.The four bottom mel-spectrograms in the left column are the sources estimated by the SSep block.Next to each source mel-spectrogram, its corresponding SED score sequences are represented.Over the source-level scores, the mixture-level score sequences are shown.Finally, the upper right plot represents the ground truth annotations of the segment.The figure is best viewed in color.https://doi.org/10.1371/journal.pone.0303994.g014

Fig 15 .
Fig 15.Visualization of the audio segment "AW9ZKFZKhDE_49_59", from DESED Public evaluation, as processed by the JSS model DESED-JT (Teacher model selection).The top left plot is the input mixture waveform, and below the mixture mel-spectrogram is shown.The four bottom mel-spectrograms in the left column are the sources estimated by the SSep block.Next to each source mel-spectrogram, its corresponding SED score sequences are represented.Over the source-level scores, the mixture-level score sequences are shown.Finally, the upper right plot represents the ground truth annotations of the segment.The figure is best viewed in color.https://doi.org/10.1371/journal.pone.0303994.g015

Table 2 . Sound Event Detection results obtained with the DCASE 2021 SED Baseline system (SED Bs) and the Joint Source Separation + Sound Event Detection pro- posed methods: Pre-trained model (S0), Two-stage Training (S1, S2), and Joint Training (JT). Results
are provided over the DESED Validation (dev-test) and Public evaluation sets, in terms of PSDS1, PSDS2 and event-based F 1 score.Two pre-training methods for Source Separation (FUSS and DESED) and two model selection criteria (Student and Teacher) are compared.The best results for each metric/dataset are highlighted in bold.DESED dataset, underlining the ability of unsupervised source separation to leverage indomain data when the individual target sources are not available.When comparing the performance of Two-Stage Training and Joint Training (considering teacher model selection), the results show that Joint Training is slightly better at PSDS1, whereas Two-Stage Training (at Stage 2 https://doi.org/10.1371/journal.pone.0303994.t002

Table 3 . Sound Event Detection results obtained with the DCASE 2021 SSep+SED Baseline system (SSep-SED Bs) and JSS model fusions over the DESED Validation (dev-test) and Public evaluation sets, in terms of PSDS1, PSDS2, and event-based F1 score.
All the models in each fusion use Teacher model selection.The best results for each metric/dataset are highlighted in bold.by considering that the generation process of the dataset (artificially overlapping sound mixtures) and the unsupervised pre-training of the Source Separation block (separating mixtures of mixtures) can be seen as opposite operations.In other words, the pretrained SSep block is already prepared to deal with the same kind of data that is present in the Public overlap set, therefore the fine-tuning performed in Stage 2 or Joint Training does not provide any advantage in this scenario.

Table 4 . Sound Event Detection results obtained with the DCASE 2021 SED Baseline system (SED Bs) and the Joint Source Separation + Sound Event Detection pro- posed methods: Initial model (S0), Two-stage Training (S1, S2), and Joint Training (JT).
[20]lts are provided over the DESED Public overlap set[20], in terms of PSDS1, PSDS2 and event-based F 1 score.Two pre-training methods for Source Separation (FUSS and DESED) and two model selection criteria (Student and Teacher) are compared.The best results for each metric are highlighted in bold.

Table 5 . Sound Event Detection results obtained with the DCASE 2021 SSep+SED Baseline system (SSep-SED Bs) and JSS model fusions over the DESED Public overlap set, in terms of PSDS1, PSDS2, and event-based F1 score.
All the models in each fusion use Teacher model selection.The best results for each metric are highlighted in bold.